[Don't Merge][Need discussion] First-step toward a working mp on Travis #11521

Dref360 · 2018-10-29T18:49:17Z

Summary

This PR solves some issue when a worker dies and the Pool is not told. (OOM, pkill, etc)

This puts a timeout on the future and notifies the user of the issue.

Discussion

What should we do for those samples in the case of Sequence? Should we re-queue the task or compute the sample directly or just drop it?

Related Issues

PR Overview

This PR requires new unit tests [y/n] (make sure tests are included)
This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
This PR is backwards compatible [y/n]
This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

…uence_type

gabrieldemarmiesse · 2018-10-30T10:05:15Z

If I understand correctly (I'm by no means an expert in multithreading/multiprocessing), in the implementation proposed in this PR, the batch is dropped and a warning is raised. Is that right?

If it's not too much trouble to implement, I believe the best option is to recompute the sample and raise a warning for those reasons:

We don't take the risk of breaking things since other keras components might crash/behave in a weird way if the Sequence doesn't yield the right number of batches.
We keep the current behavior. It's also the expected behavior by the users.
Changing the order of the batches can cause users to wonder what is going on (silent unexpected behavior) if they overlook the warning, but this is unlikely to happen in practice.
If the crash is caused by a sample and is deterministic, the user is interested in the stacktrace. If we recompute it in the main process/thread, the stacktrace will be displayed.
On the cons side of things, we'll suffer from a performance penalty.

I might be wrong on certain points, as I don't know much about all this.

gabrieldemarmiesse · 2018-10-30T10:06:45Z

keras/utils/data_utils.py

+                    future.idx = i
                    self.queue.put(
-                        executor.apply_async(get_index, (self.uid, i)), block=True)
+                        future, block=True)


I don't think the line needs to be broken.

gabrieldemarmiesse · 2018-10-30T10:07:10Z

keras/utils/data_utils.py

+                        "An input could not be retrieved."
+                        " It could be because a worker has died."
+                        "We do not have any information on the lost sample."
+                            .format(),


I don't believe the format is useful.

+1, maybe you intended to display the index in the warning?

Yeah, but we cannot know the index in a generator.

fchollet

I am curious as to what's the interaction with fit_generator when batches are dropped? Does while steps_done < steps_per_epoch: just complete fine? Are other batches drawn at random instead of the dropped ones?

fchollet · 2018-11-18T21:52:40Z

keras/utils/data_utils.py

+                except mp.TimeoutError:
+                    idx = future.idx
+                    warnings.warn(
+                        "The input {} could not be retrieved."


Nit: use ' as quote character for consistency.

Same below.

fchollet · 2018-11-18T21:53:25Z

keras/utils/data_utils.py

+                        "An input could not be retrieved."
+                        " It could be because a worker has died."
+                        "We do not have any information on the lost sample."
+                            .format(),


+1, maybe you intended to display the index in the warning?

Dref360 · 2018-11-19T14:29:59Z

As of now, there is no way to inform fit_generator to skip an index. So every step will be offset.

Quick note that the "real" fix for travis is to use spawn instead of fork. The issue is that it's only doable with Python3.

fchollet · 2018-11-24T00:22:39Z

Quick note that the "real" fix for travis is to use spawn instead of fork. The issue is that it's only doable with Python3.

I think it would be acceptable to only test this functionality with Python 3. Our #1 priority should be to make CI reliable and fast.

gabrieldemarmiesse · 2018-11-25T21:44:20Z

Currently, most of the timouts append with CNTK and python 3.6. So fixing the timeout for 3.6 would be a very good first step.

Dref360 · 2018-11-27T18:00:59Z

Now multiprocessing tests on Sequences are running on all backends.
Also, the index is fetched after the timeout. I think the user would expect this behaviour.

We do see some Github issues where the training stops. (When using HDF5 and/or OOM)
We could force the OrderedEnqueuer to use spawn when available, it's will solve some of these issues (the OOM ones).

gabrieldemarmiesse

A few improvements on the tests are possible.

Since I don't know much about multiprocessing, a second review (maybe from @fchollet ) would be beneficial.

gabrieldemarmiesse · 2018-11-30T15:30:24Z

tests/keras/utils/data_utils_test.py

+        for _ in range(11):
+            next(gen_output)
+    assert "The input {} could not be retrieved.".format(
+        missing_idx) in str(w[-1].message)


The test will fail if other warnings are being emitted. I would recommend using pytest's utility to check that warnings are being triggered correctly: https://docs.pytest.org/en/latest/warnings.html#warns

gabrieldemarmiesse · 2018-11-30T15:30:43Z

tests/keras/utils/data_utils_test.py

+        warnings.simplefilter("always")
+        for _ in range(4 * missing_idx):
+            next(gen_output)
+    assert 'An input could not be retrieved.' in str(w[-1].message)


gabrieldemarmiesse · 2018-12-02T17:15:20Z

@Dref360 I updated the PR.

@fchollet could you review? Thank you.

gabrieldemarmiesse · 2018-12-15T12:02:59Z

Ping @fchollet could you review when you have the time? Thank you.

fchollet

LGTM as far as I can tell. Shall we merge it?

Dref360 · 2018-12-17T00:13:55Z

I think this is LGTM, this would give the user "some" information and we can move forward with using "spawn" when possible on Travis.

gabrieldemarmiesse · 2018-12-17T07:23:51Z

LGTM.

gabrieldemarmiesse · 2018-12-17T07:33:16Z

Thanks @Dref360 for working on this!

Dref360 added 3 commits October 23, 2018 14:24

Improve type-check for Sequence

452d92b

Proposition for better error handling

e906b5d

Merge branch 'master' of https://github.com/keras-team/keras into seq…

24cadfc

…uence_type

Dref360 requested review from fchollet and gabrieldemarmiesse October 29, 2018 18:49

Fix order

a57bf5a

gabrieldemarmiesse reviewed Oct 30, 2018

View reviewed changes

gabrieldemarmiesse assigned fchollet Nov 18, 2018

fchollet reviewed Nov 18, 2018

View reviewed changes

gabrieldemarmiesse mentioned this pull request Nov 22, 2018

Numpy backend implementation of rnn + cleanup of rnn tests #11622

Merged

4 tasks

Dref360 added 4 commits November 27, 2018 10:42

Test with spawn

3d5e837

Merge master

dc05b52

Fix flaky test

e2a257a

Fix flaky test

1bc99f6

gabrieldemarmiesse reviewed Nov 30, 2018

View reviewed changes

Used pytest's warns function.

c8fd39c

gabrieldemarmiesse mentioned this pull request Dec 15, 2018

Making the output of model.summary() easier to read. #11795

Closed

4 tasks

fchollet approved these changes Dec 17, 2018

View reviewed changes

gabrieldemarmiesse merged commit ca802e1 into keras-team:master Dec 17, 2018

[Don't Merge][Need discussion] First-step toward a working mp on Travis #11521

[Don't Merge][Need discussion] First-step toward a working mp on Travis #11521

Uh oh!

Conversation

Dref360 commented Oct 29, 2018

Summary

Discussion

Related Issues

PR Overview

Uh oh!

gabrieldemarmiesse commented Oct 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fchollet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dref360 commented Nov 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fchollet commented Nov 24, 2018

Uh oh!

gabrieldemarmiesse commented Nov 25, 2018

Uh oh!

Dref360 commented Nov 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabrieldemarmiesse left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabrieldemarmiesse commented Dec 2, 2018

Uh oh!

gabrieldemarmiesse commented Dec 15, 2018

Uh oh!

fchollet left a comment

Choose a reason for hiding this comment

Uh oh!

Dref360 commented Dec 17, 2018

Uh oh!

gabrieldemarmiesse commented Dec 17, 2018

Uh oh!

gabrieldemarmiesse commented Dec 17, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gabrieldemarmiesse commented Oct 30, 2018 •

edited

Loading

Dref360 commented Nov 19, 2018 •

edited

Loading

Dref360 commented Nov 27, 2018 •

edited

Loading